library(tidyverse)
library(gridExtra)
library(emmeans)
library(ggpubr)
library(kableExtra)
library(grid)
options(width=100)
Rename this file and the folder it is in with your student number (e.g., “1912345.Rmd”) Submit this rmd file and the knitted html as a zip file (e.g., 1912345.zip) by right clicking on the folder to zip it. That is, the zip file 1912345.zip should contain the folder 1912345 with only the files 1912345.Rmd and 1912345.html —
Submitted by: <2284896>
Date Sent: 15th , Jan, 2023
Module Title: Business Statistics
Module Code: IB94X0
Date/Year of Module: Term One 2022-2023
Submission Deadline: 12:00 (UK time) Monday, 16th , Jan, 2023
Word Count: —-
Number of Pages: 10
Question: [ [80%] Assignment 2 ]
“I declare that this work is entirely my own in accordance with the University’s Regulation 11 and the WBS guidelines on plagiarism and collusion. All external references and sources are clearly acknowledged and identified within the contents.
| Variable | Description |
|---|---|
| Country | Country (England, Northern Ireland, Wales) [Column 1] |
| LAType | Local authorities type [Column 2] |
| LAName | Local authorities name [Column 3] |
| Totalestablishments(includingnotyetrated&outside) | Count of total establishments (including not yet rated and outside) [Column 4] |
| Establishmentsnotyetratedforintervention | Count of not yet rated establishments for intervention [Column 5] |
| Total%ofBroadlyCompliantestablishments-A | Total percent of broadly compliant establishments (rated A) [Column 10] |
| Total%ofInterventionsachieved(premisesratedA-E) | Total percent of successful interventions of establishments (rated A-E) [Column 19] |
| Total%ofInterventionsachieved-premisesratedA | Total percent of successful interventions of establishments(rated A) [Column 20] |
| Total%ofInterventionsachieved-premisesratedB | Total percent of successful interventions of establishments(rated B) [Column 21] |
| Total%ofInterventionsachieved-premisesratedC | Total percent of successful interventions of establishments(rated C) [Column 22] |
| Total%ofInterventionsachieved-premisesratedD | Total percent of successful interventions of establishments(rated D) [Column 23] |
| Total%ofInterventionsachieved-premisesratedE | Total percent of successful interventions of establishments(rated E) [Column 24] |
| Total%ofInterventionsachieved-premisesnotyetrated | Total percent of successful interventions of not yet rated establishments [Column 25] |
| ProfessionalFullTimeEquivalentPosts-occupied * | Number of employees [Column 36] |
__*Note: There will be new columns in the following analysis…__
data<-read_csv("Desktop/BS_FInal/2019-20-enforcement-data-food-hygiene.csv")
#str(data) #check the structure of the data.
#summary(data) #check the data overview.
#we find that in Interventionsachieved.premisesratedA has some weird value "NR", so I let the column as numeric again.
data$`Total%ofInterventionsachieved-premisesratedA`<-as.numeric(data$`Total%ofInterventionsachieved-premisesratedA`)
#check NAs in each column of the data
for(i in colnames(data)){print(table(is.na(data[,i])))}
##
## FALSE
## 353
##
## FALSE
## 353
##
## FALSE
## 353
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 323 30
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
##
## FALSE TRUE
## 347 6
#To see what happened in those rows with NAs
data[which(is.na(data[,4]==T)),]
## # A tibble: 6 × 36
## Country LAType LAName Total…¹ Estab…² Estab…³ Total…⁴ Total…⁵ Arate…⁶ Total…⁷ Brate…⁸ Total…⁹
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 England District C… Broxb… NA NA NA NA NA NA <NA> NA NA
## 2 England District C… Chorl… NA NA NA NA NA NA <NA> NA NA
## 3 England District C… East … NA NA NA NA NA NA <NA> NA NA
## 4 England District C… Hyndb… NA NA NA NA NA NA <NA> NA NA
## 5 England District C… Tamwo… NA NA NA NA NA NA <NA> NA NA
## 6 England Metropolit… West … NA NA NA NA NA NA <NA> NA NA
## # … with 24 more variables: Cratedestablishments <dbl>,
## # `Total%ofBroadlyCompliantestablishments-C` <dbl>, Dratedestablishments <dbl>,
## # `Total%ofBroadlyCompliantestablishments-D` <dbl>, Eratedestablishments <dbl>,
## # `Total%ofBroadlyCompliantestablishments-E` <dbl>,
## # `Total%ofInterventionsachieved(premisesratedA-E)` <dbl>,
## # `Total%ofInterventionsachieved-premisesratedA` <dbl>,
## # `Total%ofInterventionsachieved-premisesratedB` <dbl>, …
#seems like all NAs are occured in the same row, thus we use na.omit to delete all of them.
#Clean NAs
newdata<-na.omit(data)
newdata<-newdata%>%filter(newdata$`Total%ofInterventionsachieved-premisesratedA`!="NR")
Create a plot that clearly shows the distribution across the Local Authorities (LAs) of the % of enforcement actions successfully achieved. The plot below shows the distribution across the Local Authorities of the % of enforcement actions successfully achieved for all impact levels combined.
I use jitter plot to show the distribution because I think in this case, using jitter plot with alpha, we can figure out the density of each percentage achieved in LAs. The color is darker, the density is higher.
overall<-ggplot(newdata)+
geom_jitter(aes(LAName,`Total%ofInterventionsachieved(premisesratedA-E)`),color="red",alpha=0.2)+
facet_wrap(~LAType)+
labs(title = "The % of enforcement achieved",
x="Each point stands for LA belongs to different LA types",
y="Percentage achieved(%)")+
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
overall
rankA<-ggplot(newdata)+
geom_jitter(aes(LAName,`Total%ofInterventionsachieved-premisesratedA`),color="red",alpha=0.2)+
facet_wrap(~LAType)+
labs(title="A rank % achievement",x="",y="")+
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
rankB<-ggplot(newdata)+
geom_jitter(aes(LAName,`Total%ofInterventionsachieved-premisesratedB`),color="red",alpha=0.2)+
facet_wrap(~LAType)+
labs(title="B rank % achievement",x="",y="")+
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
rankC<-ggplot(newdata)+
geom_jitter(aes(LAName,`Total%ofInterventionsachieved-premisesratedC`),color="red",alpha=0.2)+
facet_wrap(~LAType)+
labs(title="C rank % achievement",x="",y="")+
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
rankD<-ggplot(newdata)+
geom_jitter(aes(LAName,`Total%ofInterventionsachieved-premisesratedD`),color="red",alpha=0.2)+
facet_wrap(~LAType)+
labs(title="D rank % achievement",x="",y="")+
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
rankE<-ggplot(newdata)+
geom_jitter(aes(LAName,`Total%ofInterventionsachieved-premisesratedE`),color="red",alpha=0.2)+
facet_wrap(~LAType)+
labs(title="E rank % achievement",x="",y="")+
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
Allrank<-ggarrange(rankA, rankB, rankC, rankD, rankE, labels = NULL ,common.legend = TRUE, legend = "bottom",align = "hv",font.label = list(size = 10, color = "black", face = "bold", family = NULL, position = "top"))
annotate_figure(Allrank, left = textGrob("Percentage achieved(%)", rot = 90, vjust = 1, gp = gpar(cex = 1.3)), bottom = textGrob("Each point stands for LA belongs to different LA types", gp = gpar(cex = 1.3)))
Here, I categorize the percentage achievement in every rank by every 10%(10090,9080,80~70…etc) and using histogram plot to show the distribution since histogram is more clear than jitter plot if we want to see the distribution as a whole. Also, I rename the columns for quicker coding.
colnames(newdata)[19] ="overall"
newdata<-newdata%>%mutate(all_rank_segment=case_when(0.00<=newdata["overall"]&newdata["overall"]<10.00 ~ "0~10%",
10.00<=newdata["overall"]&newdata["overall"]<20.00 ~ "10%~20%",
20.00<=newdata["overall"]&newdata["overall"]<30.00 ~ "20%~30%",
30.00<=newdata["overall"]&newdata["overall"]<40.00 ~ "30%~40%",
40.00<=newdata["overall"]&newdata["overall"]<50.00 ~ "40%~50%",
50.00<=newdata["overall"]&newdata["overall"]<60.00 ~ "50%~60%",
60.00<=newdata["overall"]&newdata["overall"]<70.00 ~ "60%~70%",
70.00<=newdata["overall"]&newdata["overall"]<80.00 ~ "70%~80%",
80.00<=newdata["overall"]&newdata["overall"]<90.00 ~ "80%~90%",
90.00<=newdata["overall"]&newdata["overall"]<=100.00 ~ "90%~100%"))
colnames(newdata)[20] ="A_rank"
newdata<-newdata%>%mutate(A_rank_segment=case_when(0.00<=newdata["A_rank"]&newdata["A_rank"]<10.00 ~ "0~10%",
10.00<=newdata["A_rank"]&newdata["A_rank"]<20.00 ~ "10%~20%",
20.00<=newdata["A_rank"]&newdata["A_rank"]<30.00 ~ "20%~30%",
30.00<=newdata["A_rank"]&newdata["A_rank"]<40.00 ~ "30%~40%",
40.00<=newdata["A_rank"]&newdata["A_rank"]<50.00 ~ "40%~50%",
50.00<=newdata["A_rank"]&newdata["A_rank"]<60.00 ~ "50%~60%",
60.00<=newdata["A_rank"]&newdata["A_rank"]<70.00 ~ "60%~70%",
70.00<=newdata["A_rank"]&newdata["A_rank"]<80.00 ~ "70%~80%",
80.00<=newdata["A_rank"]&newdata["A_rank"]<90.00 ~ "80%~90%",
90.00<=newdata["A_rank"]&newdata["A_rank"]<=100.00 ~ "90%~100%"))
colnames(newdata)[21] ="B_rank"
newdata<-newdata%>%mutate(B_rank_segment=case_when(0.00<=newdata["B_rank"]&newdata["B_rank"]<10.00 ~ "0~10%",
10.00<=newdata["B_rank"]&newdata["B_rank"]<20.00 ~ "10%~20%",
20.00<=newdata["B_rank"]&newdata["B_rank"]<30.00 ~ "20%~30%",
30.00<=newdata["B_rank"]&newdata["B_rank"]<40.00 ~ "30%~40%",
40.00<=newdata["B_rank"]&newdata["B_rank"]<50.00 ~ "40%~50%",
50.00<=newdata["B_rank"]&newdata["B_rank"]<60.00 ~ "50%~60%",
60.00<=newdata["B_rank"]&newdata["B_rank"]<70.00 ~ "60%~70%",
70.00<=newdata["B_rank"]&newdata["B_rank"]<80.00 ~ "70%~80%",
80.00<=newdata["B_rank"]&newdata["B_rank"]<90.00 ~ "80%~90%",
90.00<=newdata["B_rank"]&newdata["B_rank"]<=100.00 ~ "90%~100%"))
colnames(newdata)[22] ="C_rank"
newdata<-newdata%>%mutate(C_rank_segment=case_when(0.00<=newdata["C_rank"]&newdata["C_rank"]<10.00 ~ "0~10%",
10.00<=newdata["C_rank"]&newdata["C_rank"]<20.00 ~ "10%~20%",
20.00<=newdata["C_rank"]&newdata["C_rank"]<30.00 ~ "20%~30%",
30.00<=newdata["C_rank"]&newdata["C_rank"]<40.00 ~ "30%~40%",
40.00<=newdata["C_rank"]&newdata["C_rank"]<50.00 ~ "40%~50%",
50.00<=newdata["C_rank"]&newdata["C_rank"]<60.00 ~ "50%~60%",
60.00<=newdata["C_rank"]&newdata["C_rank"]<70.00 ~ "60%~70%",
70.00<=newdata["C_rank"]&newdata["C_rank"]<80.00 ~ "70%~80%",
80.00<=newdata["C_rank"]&newdata["C_rank"]<90.00 ~ "80%~90%",
90.00<=newdata["C_rank"]&newdata["C_rank"]<=100.00 ~ "90%~100%"))
colnames(newdata)[23] ="D_rank"
newdata<-newdata%>%mutate(D_rank_segment=case_when(0.00<=newdata["D_rank"]&newdata["D_rank"]<10.00 ~ "0~10%",
10.00<=newdata["D_rank"]&newdata["D_rank"]<20.00 ~ "10%~20%",
20.00<=newdata["D_rank"]&newdata["D_rank"]<30.00 ~ "20%~30%",
30.00<=newdata["D_rank"]&newdata["D_rank"]<40.00 ~ "30%~40%",
40.00<=newdata["D_rank"]&newdata["D_rank"]<50.00 ~ "40%~50%",
50.00<=newdata["D_rank"]&newdata["D_rank"]<60.00 ~ "50%~60%",
60.00<=newdata["D_rank"]&newdata["D_rank"]<70.00 ~ "60%~70%",
70.00<=newdata["D_rank"]&newdata["D_rank"]<80.00 ~ "70%~80%",
80.00<=newdata["D_rank"]&newdata["D_rank"]<90.00 ~ "80%~90%",
90.00<=newdata["D_rank"]&newdata["D_rank"]<=100.00 ~ "90%~100%"))
colnames(newdata)[24] ="E_rank"
newdata<-newdata%>%mutate(E_rank_segment=case_when(0.00<=newdata["E_rank"]&newdata["E_rank"]<10.00 ~ "0~10%",
10.00<=newdata["E_rank"]&newdata["E_rank"]<20.00 ~ "10%~20%",
20.00<=newdata["E_rank"]&newdata["E_rank"]<30.00 ~ "20%~30%",
30.00<=newdata["E_rank"]&newdata["E_rank"]<40.00 ~ "30%~40%",
40.00<=newdata["E_rank"]&newdata["E_rank"]<50.00 ~ "40%~50%",
50.00<=newdata["E_rank"]&newdata["E_rank"]<60.00 ~ "50%~60%",
60.00<=newdata["E_rank"]&newdata["E_rank"]<70.00 ~ "60%~70%",
70.00<=newdata["E_rank"]&newdata["E_rank"]<80.00 ~ "70%~80%",
80.00<=newdata["E_rank"]&newdata["E_rank"]<90.00 ~ "80%~90%",
90.00<=newdata["E_rank"]&newdata["E_rank"]<=100.00 ~ "90%~100%"))
all<-ggplot(newdata,aes(overall))+geom_histogram(fill="cyan",color="black")+xlim(0,101)+
labs(title="The % of enforcement achieved across all rank",x="% achieved",y="Total numbers of LA")
all
A<-ggplot(newdata,aes(A_rank))+geom_histogram(fill="red",color="black")+xlim(0,101)+
labs(title="Rank A % achievement",x="% achieved",y="Total numbers of LA")
B<-ggplot(newdata,aes(B_rank))+geom_histogram(fill="red",color="black")+xlim(0,101)+
labs(title="Rank B % achievement",x="% achieved",y="Total numbers of LA")
C<-ggplot(newdata,aes(C_rank))+geom_histogram(fill="red",color="black")+xlim(0,101)+
labs(title="Rank C % achievement",x="% achieved",y="Total numbers of LA")
D<-ggplot(newdata,aes(D_rank))+geom_histogram(fill="red",color="black")+xlim(0,101)+
labs(title="Rank D % achievement",x="% achieved",y="Total numbers of LA")
E<-ggplot(newdata,aes(E_rank))+geom_histogram(fill="red",color="black")+xlim(0,101)+
labs(title="Rank E % achievement",x="% achieved",y="Total numbers of LA")
ggarrange(A,B,C,D,E)
Examine whether there is a relationship between proportion of successful responses and the number of FTE food safety employees in each local authority. Examine whether there is a relationship between proportion of successful responses and the number of employees as a proportion of the number of establishments in the local authority – essentially creating your own measure of how many food safety employees there are per establishment.
#Firstly, I sum up the interventions and name the column as "total_intervention_achieved".
#For every relationship test, I want to plot them first so we can double check the linear model with the plot.
newdata<-newdata%>%mutate(total_intervention_achieved=rowSums(newdata[,26:35])) #sum up all the interventions and mutate as a column
newdata<-newdata%>%mutate(interventions_achieved_proportion=total_intervention_achieved*100/(`Totalestablishments(includingnotyetrated&outside)`-Establishmentsnotyetratedforintervention)) #Calculate the percentage achieved
ggplot(newdata,aes(`ProfessionalFullTimeEquivalentPosts-occupied *`,interventions_achieved_proportion))+geom_point()+geom_smooth(method="lm")+labs(title = "The relationship of successful responses(%) and the number of FTE employees",x="How many FTE employees sent to LA?",y="The proportion of successful responses(%) for each LA") #Plot it
emp<-lm(interventions_achieved_proportion~`ProfessionalFullTimeEquivalentPosts-occupied *`,data=newdata)
summary(emp)
##
## Call:
## lm(formula = interventions_achieved_proportion ~ `ProfessionalFullTimeEquivalentPosts-occupied *`,
## data = newdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.486 -6.962 -0.167 7.088 35.017
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.6386 1.1804 20.873 < 2e-16 ***
## `ProfessionalFullTimeEquivalentPosts-occupied *` 0.9670 0.2402 4.026 7.08e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.91 on 321 degrees of freedom
## Multiple R-squared: 0.04807, Adjusted R-squared: 0.0451
## F-statistic: 16.21 on 1 and 321 DF, p-value: 7.084e-05
#we observe that every additional employee send to supervise the restaurant will increase the achievement rate by roughly 1%(0.967%).
#And also, the effect of number of employees sent has significant impact on the achievement rate.
Examine whether there is a relationship between proportion of successful responses and the number of employees as a proportion of the number of establishments in the local authority – essentially creating your own measure of how many food safety employees there are per establishment.
newdata<-newdata%>%mutate(employee_rate=newdata$`ProfessionalFullTimeEquivalentPosts-occupied *`/(`Totalestablishments(includingnotyetrated&outside)`-newdata$Establishmentsnotyetratedforintervention))
Q3<-lm(interventions_achieved_proportion~employee_rate,data = newdata)
Q3
##
## Call:
## lm(formula = interventions_achieved_proportion ~ employee_rate,
## data = newdata)
##
## Coefficients:
## (Intercept) employee_rate
## 24.53 1458.15
summary(Q3) #We can see that the coefficient of employee rate is significant(t-value=2.51, p-value=0.0126<0.05)
##
## Call:
## lm(formula = interventions_achieved_proportion ~ employee_rate,
## data = newdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.204 -6.889 -0.176 7.196 36.011
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.533 1.776 13.81 <2e-16 ***
## employee_rate 1458.147 580.967 2.51 0.0126 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.08 on 321 degrees of freedom
## Multiple R-squared: 0.01925, Adjusted R-squared: 0.01619
## F-statistic: 6.299 on 1 and 321 DF, p-value: 0.01257
#Which means that the proportion of employees does impact the proportion of interventions successfully achieved
ggplot(newdata,aes(employee_rate,interventions_achieved_proportion))+geom_point()+geom_smooth(method="lm")+labs(title="Employee per establishment V.S. Interventions achieved proportion",x="Employee per establishment", y="Intervention achieved rate(%)")
The following report is the analysis of food safety supervised by local authorities of England, Wales, and Northern Ireland. The managers of Food Standards Agency and politicians would like to see how well the food safety is, and how can they improve, if necessary, and maintain the current situation. Before we head to the requested questions, let’s browse the data and understand the current situation first. There are 353 observations, which is 353 local authorities in total. But 30 of them have no available data ( NAs and misleading value such as NP), so the total data in the analysis report is 323 local authorities with 36 variables. In Table 1., we can see that in England there are four types of local authorities: “District Council”, “London Borough”, “Metropolitan Borough Council”, and “Unitary Authority”. But in other two countries, there are only one type of local authorities. The District Council in England holds more than 50% of local authorities.
| Country | LAType | Number |
|---|---|---|
| England | District Council | 169 |
| England | London Borough | 33 |
| England | Metropolitan Borough Council | 35 |
| England | Unitary Authority | 54 |
| Northern Ireland | NI Unitary Authority | 10 |
| Wales | Welsh Unitary Authority | 22 |
The managers of Food Standards Agency would like to know the percentage of establishments successfully respond to enforcement actions as a whole. Here we have the distribution across all the local authorities of the percentage of enforcement actions successfully achieved. The histogram below shows the percentage of enforcement achieved among all ranks from A to E. We can see the distribution is left-skewed, which means most of local authorities have done a good job to let the establishments achieved the food safety standards. However, we still find some authorities did not do well, only have 50% or lower percentage achieved.
But, I suggest that there might be differences for establishments rated A, B, C, D, and E. So I create this plot to show the distribution in each rank. The plot here presents that there are fewer numbers in rank A, and rank E seems to have the most number of LAs. However, the distribution of ranks looks all similar in histogram plot and we cannot have more insights. Thus I decide to make another jitter plot to see the distribution.
The jitter plot can show the distribution in a more clear way. Each spot in the graph represent one LA, the transparency of the spot means the density of LAs in the same level(y-axis). For example, if the color is pink, means there are not so many LAs in that level(y-axis) but is the color is red, means there are many LAs in the same level(y-axis).
I split the LA types to see which one performs worst so we know where can we improve first. This graph basically shows us that in A rank, most authorities have nearly 100% achieved but in District Council there are some LA didn’t do well. The reason might be there are too many authorities so it is hard to keep every LA in a highly performances. The same situations happened in all other ranks except E. The distribution in E rank shows that many authorities did not do well on E rank establishment as they did on other ranks. I suggest they might think E rank is the least dangerous, or emergency, establishments so they are a little bit loose when supervised them.
Then, the manager also wants to know whether employing more professional enforcement officers increases the likelihood of establishments successfully responding to interventions? To study this question, I want to see if there is a relationship between proportion of successful responses and the number of employees sent. Firstly, I sum up all interventions for each LA and divided by the total establishments to get the proportion of establishments that achieved the interventions. Then, I build a model to see if there is any relationship between numbers of employees and the percentage of achievements.
The plot below shows that there is a positive correlation between the number of employees in each LA and the percentage of successful responses in each LA. I observe that every additional employee send to supervise the restaurant will increase the achievement rate by roughly 1%(0.967%). And we can conclude that the effect of number of employees sent has significant impact on the achievement rate(t-value=4.026, p-value<0.05).
But we know that sometimes using proportion is better than using real numbers to do analysis since there might be some authorities that has many employees but fewer cases, and others have fewer employees but need to deal with more cases. So, I want to examine whether there is a relationship between proportion of successful responses and the number of employees as a proportion of the number of establishments in the local authority. I divide the number of employees by total establishment to get the proportion and build a model to test the correlation of the proportion of intervention achieved and proportion of employee. In the graph we can easily see that there is a positive correlation between them. The higher proportion of employee rate is, the higher proportion of intervention successfully achieved will be.
| Variable | Description |
|---|---|
| sold by | Sold by which company |
| publisher.type | Publisher type |
| genre | Three genre (children, fiction, non_fiction) |
| avg.review | Average review of books |
| daily.sales | Daily sales of books |
| total.reviews | Total review of books |
| sale.price | The price of books |
publisher_sales <- read_csv("Desktop/BS_FInal/publisher_sales.csv") #load the data
publisher_sales<-na.omit(publisher_sales) #check NAs in the data
summary(publisher_sales)
## sold by publisher.type genre avg.review daily.sales
## Length:6000 Length:6000 Length:6000 Min. :0.000 Min. : -0.53
## Class :character Class :character Class :character 1st Qu.:4.100 1st Qu.: 56.77
## Mode :character Mode :character Mode :character Median :4.400 Median : 74.29
## Mean :4.267 Mean : 79.11
## 3rd Qu.:4.620 3rd Qu.: 98.02
## Max. :4.980 Max. :207.98
## total.reviews sale.price
## Min. : 0.0 Min. : 0.740
## 1st Qu.:105.0 1st Qu.: 7.140
## Median :128.0 Median : 8.630
## Mean :132.6 Mean : 8.641
## 3rd Qu.:163.0 3rd Qu.:10.160
## Max. :248.0 Max. :17.460
str(publisher_sales)
## tibble [6,000 × 7] (S3: tbl_df/tbl/data.frame)
## $ sold by : chr [1:6000] "Random House LLC" "Amazon Digital Services, Inc." "Amazon Digital Services, Inc." "Amazon Digital Services, Inc." ...
## $ publisher.type: chr [1:6000] "big five" "indie" "small/medium" "small/medium" ...
## $ genre : chr [1:6000] "childrens" "non_fiction" "non_fiction" "fiction" ...
## $ avg.review : num [1:6000] 4.44 4.19 3.71 4.72 4.65 4.81 4.33 4.21 3.95 4.66 ...
## $ daily.sales : num [1:6000] 61.5 74.9 66 85.2 37.7 ...
## $ total.reviews : num [1:6000] 92 130 118 179 111 106 205 86 161 81 ...
## $ sale.price : num [1:6000] 8.03 9.08 9.48 12.32 5.78 ...
Do books from different genres have different daily sales on average?
#Firstly, I want to visualize the whole data for better understanding.
company<-table(publisher_sales$`sold by`)%>%as.data.frame()
Books<-ggplot(company,aes(x=reorder(Var1,-Freq),Freq))+
geom_bar(stat = "identity",color="black",fill="cyan")+labs(title="The variety of books sold by company",x="Company Names",y="The variety of Books")+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
Books
#In this chart, we can see that ADS has the most variety of books for sale, following by Random House LLC and Penguin Group(USA) LLC.
pub<-table(publisher_sales$publisher.type)%>%as.data.frame()
Books_pub<-ggplot(pub,aes(x=reorder(Var1,-Freq),Freq))+
geom_bar(stat = "identity",color="black",fill="purple")+labs(title="The variety of books by publisher type",x="Publisher Types",y="The variety of Books")+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
Books_pub
#In this chart, we can see that small/medium firms have the most variety of books for sale. Surprisingly, amazon holds the last place of the variety of books.
#To check the difference between genres, we have to categorize and summarize them by genres:
dailysales<-publisher_sales%>%group_by(genre)%>%summarise(sales=mean(daily.sales))
dailysales_pub<-publisher_sales%>%group_by(publisher.type,genre)%>%summarise(sales=mean(daily.sales))
#Then, we may want to visulize the data to have a more straightforward look:
#Below is the plot grouped by genres only:
sales_plot_by_genres<-ggplot(dailysales,aes(x=reorder(genre,-sales),y=sales))+
geom_bar(stat = "identity",color="black",fill="pink")+
geom_text(aes(label=paste(round(sales,2),'books per day'),vjust=2),size=4)+
labs(title="Avg sales by Genres",x="Genres",y="Sales(books per day)")
sales_plot_by_genres
#Below is the plot grouped by genres and publishers:
sales_plot_by_publisher<-ggplot(dailysales_pub,aes(x=reorder(genre,-sales),y=sales,fill=publisher.type))+
geom_bar(stat = "identity",color="black",position = "dodge")+
labs(title="Avg Sales by Genres and Publishers",x="Genres",y="Sales(books per day)")
sales_plot_by_publisher
#Now we can use anova to see the difference in different genres
anova_sales<-aov(daily.sales~genre,data=publisher_sales)
summary(anova_sales)
## Df Sum Sq Mean Sq F value Pr(>F)
## genre 2 2562528 1281264 2590 <2e-16 ***
## Residuals 5997 2966133 495
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans(anova_sales,~genre)
## genre emmean SE df lower.CL upper.CL
## childrens 55.6 0.497 5997 54.6 56.6
## fiction 105.9 0.497 5997 104.9 106.9
## non_fiction 75.9 0.497 5997 74.9 76.8
##
## Confidence level used: 0.95
#Now, the anova shows us that the genre has a F-value of 2590 and p-value is extremely small.
#Thus we can conclude that the sales between different genres have significant differences.
Do books have more/fewer sales depending upon their average review scores and total number of reviews.
#Firstly, I created two lm models, one without interaction term and the other with interaction term between review scores and total reviews.
publisher_sales<-publisher_sales%>%filter(total.reviews!=0 & avg.review!=0)#filter out those with 0 reviews and the score is 0(because the score is 0 due to no reviews) to prevent any impact on the linear model
nointeraction<-lm(daily.sales~total.reviews+avg.review, data=publisher_sales) #The model without interaction term
interaction<-lm(daily.sales~total.reviews+avg.review+total.reviews*avg.review, data=publisher_sales) #The model with interaction term
summary(interaction)
##
## Call:
## lm(formula = daily.sales ~ total.reviews + avg.review + total.reviews *
## avg.review, data = publisher_sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104.327 -14.583 -0.773 13.859 93.915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.36525 9.34333 -0.788 0.431
## total.reviews 0.65854 0.06740 9.770 <2e-16 ***
## avg.review 2.61997 2.16273 1.211 0.226
## total.reviews:avg.review -0.02181 0.01560 -1.398 0.162
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.21 on 5973 degrees of freedom
## Multiple R-squared: 0.4653, Adjusted R-squared: 0.465
## F-statistic: 1733 on 3 and 5973 DF, p-value: < 2.2e-16
summary(nointeraction)
##
## Call:
## lm(formula = daily.sales ~ total.reviews + avg.review, data = publisher_sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104.334 -14.632 -0.742 13.839 93.473
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.163958 2.651313 1.948 0.0515 .
## total.reviews 0.564926 0.007838 72.077 <2e-16 ***
## avg.review -0.299177 0.565828 -0.529 0.5970
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.22 on 5974 degrees of freedom
## Multiple R-squared: 0.4651, Adjusted R-squared: 0.465
## F-statistic: 2598 on 2 and 5974 DF, p-value: < 2.2e-16
#Then, I compared these two models to find out which one with/without interaction term is the better one?
anova(nointeraction,interaction) #It seems like with/without interaction term, the model has not many differences.(F=0.162, p-value>0.05)
## Analysis of Variance Table
##
## Model 1: daily.sales ~ total.reviews + avg.review
## Model 2: daily.sales ~ total.reviews + avg.review + total.reviews * avg.review
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5974 2948425
## 2 5973 2947460 1 965.05 1.9557 0.162
#In both models, the avg review score is not significant so we can conclude that the avg review score does not impact the sales of books.
#I want to test if I put only avg review score in the model, will this variable be significant or not. Because in intuition, we suggest that the higher review score is, the higher sales the book would have.
onlyscore<-lm(daily.sales~avg.review,data=publisher_sales)
summary(onlyscore) #But it seems like the avg review score is higher or not doesn't have much impact on sales
##
## Call:
## lm(formula = daily.sales ~ avg.review, data = publisher_sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.858 -22.338 -4.808 18.985 128.932
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.7954 3.3371 23.912 <2e-16 ***
## avg.review -0.1618 0.7736 -0.209 0.834
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.37 on 5975 degrees of freedom
## Multiple R-squared: 7.317e-06, Adjusted R-squared: -0.00016
## F-statistic: 0.04372 on 1 and 5975 DF, p-value: 0.8344
#For every average review score increase by 1, book sales decrease only 0.1618 books(p-value=0.834>0.05)
#Visualize the data with the point plot.
publisher_sales$publisher.type<-as.factor(publisher_sales$genre)
ggplot(publisher_sales,aes(total.reviews,daily.sales,color=genre))+geom_point(alpha=0.2)+
labs(title = "Number of reviews V.S. Daily Sales",x="Total number of reviews",y="Average Daily Sales(Books/per day)")+geom_smooth(method="lm")
ggplot(publisher_sales,aes(avg.review,daily.sales,color=genre))+geom_point(alpha=0.2)+
labs(title = "Review Scores V.S. Daily Sales",x="Average Review Scores",y="Average Daily Sales(Books/per day)")+geom_smooth(method="lm")
What is the effect of sale price upon the number of sales, and is this different across genres?
sales_genre<-lm(daily.sales~sale.price+genre,data=publisher_sales) #Without interaction term
sales_genre_interaction<-lm(daily.sales~sale.price+genre+genre*sale.price,data=publisher_sales) #With interaction term
anova(sales_genre,sales_genre_interaction) #Compare two models
## Analysis of Variance Table
##
## Model 1: daily.sales ~ sale.price + genre
## Model 2: daily.sales ~ sale.price + genre + genre * sale.price
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5973 2938502
## 2 5971 2928492 2 10010 10.205 3.764e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#In the anova table, we can find that the model with interaction term is different from the model without interaction term (F-value=10.205, p-value<=0.01). Thus, the model with interaction term is the better model that can explain the data more.
summary(sales_genre_interaction)
##
## Call:
## lm(formula = daily.sales ~ sale.price + genre + genre * sale.price,
## data = publisher_sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -102.407 -13.372 0.039 13.088 102.343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.8120 2.5067 29.047 < 2e-16 ***
## sale.price -1.7265 0.2459 -7.021 2.45e-12 ***
## genrefiction 35.2911 3.2793 10.762 < 2e-16 ***
## genrenon_fiction 6.6036 3.2085 2.058 0.03962 *
## sale.price:genrefiction 1.4531 0.3551 4.092 4.34e-05 ***
## sale.price:genrenon_fiction 1.2767 0.3474 3.675 0.00024 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.15 on 5971 degrees of freedom
## Multiple R-squared: 0.4688, Adjusted R-squared: 0.4683
## F-statistic: 1054 on 5 and 5971 DF, p-value: < 2.2e-16
# Main effect of sale price tells us that there are different numbers of sales in different prices
# Main effect of genre tells us that there are a different number of sales among children, fiction, and nonfiction books.
# Interaction tells us that the pattern of sales across sale prices is different for every genres.
emmeans(sales_genre_interaction,~genre+sale.price)
## genre sale.price emmean SE df lower.CL upper.CL
## childrens 8.64 57.9 0.597 5971 56.7 59.1
## fiction 8.64 105.7 0.521 5971 104.7 106.8
## non_fiction 8.64 75.5 0.528 5971 74.5 76.6
##
## Confidence level used: 0.95
summary(sales_genre_interaction.emm <- emmeans(sales_genre_interaction,~genre+sale.price))
## genre sale.price emmean SE df lower.CL upper.CL
## childrens 8.64 57.9 0.597 5971 56.7 59.1
## fiction 8.64 105.7 0.521 5971 104.7 106.8
## non_fiction 8.64 75.5 0.528 5971 74.5 76.6
##
## Confidence level used: 0.95
genre<-lm(daily.sales~genre,data=publisher_sales)
genre
##
## Call:
## lm(formula = daily.sales ~ genre, data = publisher_sales)
##
## Coefficients:
## (Intercept) genrefiction genrenon_fiction
## 55.56 50.35 20.30
price<-lm(daily.sales~sale.price,data = publisher_sales)
price
##
## Call:
## lm(formula = daily.sales ~ sale.price, data = publisher_sales)
##
## Coefficients:
## (Intercept) sale.price
## 112.093 -3.818
#Filter out all genre into independent data and use regression to see the correlation
g_fiction<-publisher_sales%>%filter(genre=="fiction")
g_nonfiction<-publisher_sales%>%filter(genre=="non_fiction")
g_children<-publisher_sales%>%filter(genre=="childrens")
lm(daily.sales~sale.price,data=g_fiction) #When sale price goes up for $1, the sales will decrease by 0.2732 books
##
## Call:
## lm(formula = daily.sales ~ sale.price, data = g_fiction)
##
## Coefficients:
## (Intercept) sale.price
## 108.1030 -0.2734
lm(daily.sales~sale.price,data=g_nonfiction) #When sale price goes up for $1, the sales will decrease by 0.4502 books
##
## Call:
## lm(formula = daily.sales ~ sale.price, data = g_nonfiction)
##
## Coefficients:
## (Intercept) sale.price
## 79.4156 -0.4498
lm(daily.sales~sale.price,data=g_children) #When sale price goes up for $1, the sales will decrease by 1.732 books
##
## Call:
## lm(formula = daily.sales ~ sale.price, data = g_children)
##
## Coefficients:
## (Intercept) sale.price
## 72.812 -1.727
EMM<-ggplot(summary(sales_genre_interaction.emm), mapping=aes(x=genre, y=emmean, ymin=lower.CL, ymax=upper.CL))+
geom_point(color="red")+
geom_linerange(aes(ymin=lower.CL, ymax=upper.CL))+ylim(c(55,110))+
labs(y="Estimated Sales(Number of books)", x="Genres",
subtitle="Error bars are 95% CIs",
title="Difference in Sales across Genres")+
geom_hline(mapping=aes(yintercept=lower.CL), lty=2)+
geom_hline(mapping=aes(yintercept=upper.CL), lty=2)
EMM
Effect<-ggplot(publisher_sales,aes(sale.price,daily.sales,color=genre))+
geom_point(alpha=0.2)+
geom_smooth(method="lm")+
labs(title="The effect of sale price upon the number of sales and genres",x="Book Price(£)",y="Average Daily Sales(books/day)")
Effect
Seperate<-ggplot(publisher_sales,aes(sale.price,daily.sales))+
geom_point(color="red",alpha=0.2)+
geom_smooth(method="lm")+
labs(title="The effect of sale price upon the number of sales and genres",x="Book Price (£)",y="Average Daily Sales(books/day)")+facet_wrap(genre~.)
Seperate
The following analysis report delve into the data of e-book sales over several months among multiple sellers, including Amazon Digital Service, Random House LLC, and Harper Collins Publishers etc… The managers of a publishing company request to know the relationships and coefficient across genres, average daily sales, book prices(£), and review scores. The report will bring the audience through the data and deal with the above problem in a clear and vivid way.
Let’s take a brief look at the data first. In Table 1., we can see that ADS dominate the variety of e-book sold in the market, following by Random House LLC and Penguin Group (USA) LLC. The bar plot of Figure 1. shows the data more clearly. There are two sellers only have one kind of e-book sold; however, it will not impact the following analysis since we only want to know the relationship of book sales, book prices, review scores, and genres. So we will keep those data.
| Organizations | Variety | |
|---|---|---|
| 1 | Amazon Digital Services, Inc. | 4311 |
| 11 | Random House LLC | 442 |
| 10 | Penguin Group (USA) LLC | 307 |
| 13 | Simon and Schuster Digital Sales Inc | 272 |
| 6 | HarperCollins Publishers | 261 |
| 9 | Macmillan | 175 |
| 4 | Hachette Book Group | 149 |
| 5 | HarperCollins Christian Publishing | 28 |
| 3 | DC Comics | 24 |
| 7 | HarperCollins Publishing | 24 |
| 8 | Idea & Design Works | 5 |
| 2 | Cengage Learning | 1 |
| 12 | RCS MediaGroup S.p.A. | 1 |
Then, let’s look at Figure 2. below, we found that the variety of book is not dominated by one kind of publisher and surprisingly, Amazon is in the last place. Small/Medium publishers have the most abundant types of e-books sold(more than 2000+).
After this, we are diving deeper and look at the data that we care the most: the book sales, prices(£), review scores, and genres.
The first question is that there are many book genres and people do have their favorite one. But is there any type of genre that just sell better than others? Or any type of genre is less popular than others? In other words, we want to know do books from different genres have different daily sales on average?
In the following images, we uncover the average sales among genres and average sales across genres and publishers. We can clearly see that fiction books is the best seller among all genres with 105.89 books sold per day, following by non fiction books with 75.87 books sold per day, and the last place goes to books for children with only 55.58 books sold per day. After statistical test with F-value=2585, p-value<0.05, we conclude that the genres of fictions, non-fictions, and children have significantly different sales and fiction e-books has the highest sales of all. The Difference in Sales across Genres plot shows the estimated sales of each genre. None of them overlapped, so we can say the sales of each genre is quite different. The sales difference among these genres just makes sense. Since the declining birth rate is impacting the world, e-books for children is the one sold the worst, and with the fast development and cutting-edge technology springing out, people are more fascinated by fiction movies and novels. So, fiction e-books is the best sellers.
If we delve deeper into those genres, we see that all publishers almost have the same sales in the same genres. This implies there might not be so many differences in terms of book sales if books are published by different publisher.
Secondly, when people search the nearest restaurant for dinner, find the most popular site to visit, they always browse the comments and review score to decide where to go. Thus, we also want to know if the same behavior happened on buying e-books? In other words, we want to know do e-books have more/fewer sales depending upon their average review scores and total number of reviews?
The graph below shows the relationship more clearly. The correlation of numbers of review and daily sales is positive and the positive correlation occurs in all genres. That is, the more reviews a book has, the more amount the book is sold. However, it doesn’t mean that if a book has more reviews than we can expect it to be the best sellers. The correlation does not mean there is a causality.
The graph below gives us the correlation between review scores and the daily sales of books in all genres. There is no significant correlation(F-value=0.044, p-value=0.834>0.05) between the review scores and the sales. Which is counter intuitive. I suggest the reason is that the data points are too concentrate in score between 4~5 and sales between 50~150 books, so the correlation is very weak.
Finally, when people buy things, they always put price on the first priority. Does this principle also applied when they buy e-books? So, we want to know the effect of sale price upon the number of sales, and is this different across genres? Here is the plot that shows the relationship between sale price and the number of sales across all genres. We can clearly see that when the sale price increase, the number of sales decrease and for every £1 pncreased in sale price, the overall number of sales will decrease by 3.816 books. Thus, there is a negative correlation between sale price and the number of sales.
However, if we take genres into account, we uncover that in fiction and non fiction books, there are no significant correlation between sale price and number of sales. For fiction books, if the sale price increases by £1, the number of books sold will decrease by 0.2732 books. For non-fiction books, if the sale price increases by £1, the number of books sold will decrease by 0.4502 books. The genre of children does have significant negative correlation between book price and number of sales. When the price of books for children increase by £1, the number of books will decrease by 1.732 books. I suggest the reason behind the scene is that parents buy an e-book for children but kids might lose interests quickly and do not read them again, so they would not willing to pay much for it.
Conclusion: The report shows us that the e-book sales among genres (fiction, non-fiction, children) are different. The best seller is fiction e-books, and the worst is e-books for children. The number of reviews does have positive correlation with the number of books sold; however, the review score does not have correlation with the number of books sold. Finally, the book prices does have negative correlation with the sales of books overall. But interestingly, sales of fiction and non-fiction e-books do not have negative correlation with the book price, only e-books for children does have negative correlation.